14 research outputs found
The Maximum Exposure Problem
Given a set of points P and axis-aligned rectangles R in the plane, a point p in P is called exposed if it lies outside all rectangles in R. In the max-exposure problem, given an integer parameter k, we want to delete k rectangles from R so as to maximize the number of exposed points. We show that the problem is NP-hard and assuming plausible complexity conjectures is also hard to approximate even when rectangles in R are translates of two fixed rectangles. However, if R only consists of translates of a single rectangle, we present a polynomial-time approximation scheme. For general rectangle range space, we present a simple O(k) bicriteria approximation algorithm; that is by deleting O(k^2) rectangles, we can expose at least Omega(1/k) of the optimal number of points
JanusAQP: Efficient Partition Tree Maintenance for Dynamic Approximate Query Processing
Approximate query processing over dynamic databases, i.e., under
insertions/deletions, has applications ranging from high-frequency trading to
internet-of-things analytics. We present JanusAQP, a new dynamic AQP system,
which supports SUM, COUNT, AVG, MIN, and MAX queries under insertions and
deletions to the dataset. JanusAQP extends static partition tree synopses,
which are hierarchical aggregations of datasets, into the dynamic setting. This
paper contributes new methods for: (1) efficient initialization of the data
synopsis in the presence of incoming data, (2) maintenance of the data synopsis
under insertions/deletions, and (3) re-optimization of the partitioning to
reduce the approximation error. JanusAQP reduces the error of a
state-of-the-art baseline by more than 60% using only 10% storage cost.
JanusAQP can process more than 100K updates per second in a single node setting
and keep the query latency at a millisecond level
A Fair and Memory/Time-efficient Hashmap
There is a large amount of work constructing hashmaps to minimize the number
of collisions. However, to the best of our knowledge no known hashing technique
guarantees group fairness among different groups of items. We are given a set
of tuples in , for a constant dimension and a set of
groups such that every
tuple belongs to a unique group. We formally define the fair hashing problem
introducing the notions of single fairness ( for every ), pairwise fairness
( for every ), and the
well-known collision probability (). The goal is to
construct a hashmap such that the collision probability, the single fairness,
and the pairwise fairness are close to , where is the number of
buckets in the hashmap.
We propose two families of algorithms to design fair hashmaps. First, we
focus on hashmaps with optimum memory consumption minimizing the unfairness. We
model the input tuples as points in and the goal is to find the
vector such that the projection of onto creates an ordering that is
convenient to split to create a fair hashmap. For each projection we design
efficient algorithms that find near optimum partitions of exactly (or at most)
buckets. Second, we focus on hashmaps with optimum fairness
(-unfairness), minimizing the memory consumption. We make the important
observation that the fair hashmap problem is reduced to the necklace splitting
problem. By carefully implementing algorithms for solving the necklace
splitting problem, we propose faster algorithms constructing hashmaps with
-unfairness using boundary points when and boundary points for
Computing Shortest Paths in the Plane with Removable Obstacles
We consider the problem of computing a Euclidean shortest path in the presence of removable obstacles in the plane. In particular, we have a collection of pairwise-disjoint polygonal obstacles, each of which may be removed at some cost c_i > 0. Given a cost budget C > 0, and a pair of points s, t, which obstacles should be removed to minimize the path length from s to t in the remaining workspace? We show that this problem is NP-hard even if the obstacles are vertical line segments. Our main result is a fully-polynomial time approximation scheme (FPTAS) for the case of convex polygons. Specifically, we compute an (1 + epsilon)-approximate shortest path in time O({nh}/{epsilon^2} log n log n/epsilon) with removal cost at most (1+epsilon)C, where h is the number of obstacles, n is the total number of obstacle vertices, and epsilon in (0, 1) is a user-specified parameter. Our approximation scheme also solves a shortest path problem for a stochastic model of obstacles, where each obstacle\u27s presence is an independent event with a known probability. Finally, we also present a data structure that can answer s-t path queries in polylogarithmic time, for any pair of points s, t in the plane
Computing Data Distribution from Query Selectivities
We are given a set , where each
is a \emph{range} in , such as rectangle or ball, and denotes its \emph{selectivity}. The goal is to compute a small-size
\emph{discrete data distribution} , where and for each ,
and , such that is the most
\emph{consistent} with , i.e.,
is minimized. In a
database setting, corresponds to a workload of range queries over
some table, together with their observed selectivities (i.e., fraction of
tuples returned), and can be used as compact model for
approximating the data distribution within the table without accessing the
underlying contents.
In this paper, we obtain both upper and lower bounds for this problem. In
particular, we show that the problem of finding the best data distribution from
selectivity queries is -complete. On the positive side, we
describe a Monte Carlo algorithm that constructs, in time
, a discrete
distribution of size , such that
(for
) where the minimum is taken over all discrete distributions. We
also establish conditional lower bounds, which strongly indicate the
infeasibility of relative approximations as well as removal of the exponential
dependency on the dimension for additive approximations. This suggests that
significant improvements to our algorithm are unlikely
Efficient Algorithms for k-Regret Minimizing Sets
A regret minimizing set Q is a small size representation of a much larger database P so that user queries executed on Q return answers whose scores are not much worse than those on the full dataset. In particular, a k-regret minimizing set has the property that the regret ratio between the score of the top-1 item in Q and the score of the top-k item in P is minimized, where the score of an item is the inner product of the item\u27s attributes with a user\u27s weight (preference) vector. The problem is challenging because we want to find a single representative set Q whose regret ratio is small with respect to all possible user weight vectors.
We show that k-regret minimization is NP-Complete for all dimensions d>=3, settling an open problem from Chester et al. [VLDB 2014]. Our main algorithmic contributions are two approximation algorithms, both with provable guarantees, one based on coresets and another based on hitting sets. We perform extensive experimental evaluation of our algorithms, using both real-world and synthetic data, and compare their performance against the solution proposed in [VLDB 14]. The results show that our algorithms are significantly faster and scalable to much larger sets than the greedy algorithm of Chester et al. for comparable quality answers
Approximating Distance Measures for the Skyline
In multi-parameter decision making, data is usually modeled as a set of points whose dimension is the number of parameters, and the skyline or Pareto points represent the possible optimal solutions for various optimization problems. The structure and computation of such points have been well studied, particularly in the database community. As the skyline can be quite large in high dimensions, one often seeks a compact summary. In particular, for a given integer parameter k, a subset of k points is desired which best approximates the skyline under some measure. Various measures have been proposed, but they mostly treat the skyline as a discrete object. By viewing the skyline as a continuous geometric hull, we propose a new measure that evaluates the quality of a subset by the Hausdorff distance of its hull to the full hull. We argue that in many ways our measure more naturally captures what it means to approximate the skyline.
For our new geometric skyline approximation measure, we provide a plethora of results. Specifically, we provide (1) a near linear time exact algorithm in two dimensions, (2) APX-hardness results for dimensions three and higher, (3) approximation algorithms for related variants of our problem, and (4) a practical and efficient heuristic which uses our geometric insights into the problem, as well as various experimental results to show the efficacy of our approach